3B: Importing Data Frame

Readings

From R Coding Basics: An Introduction to the Basics of Coding in R by Dr. Gaston Sanchez:

Topics

  • Built-in data sets

  • Importing data

  • Inspecting a data frame

  • Storing data and objects

Built-in data frames

R has many built-in data sets:

  • faithful Old Faithful Geyser Data
head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55
  • iris Edgar Anderson’s Iris Data
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
  • mtcars Motor Trend Car Road Tests
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  • USArrests Violent Crime Rates by US State
head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

Reading data from the clipboard

A data frame can be made by reading from the clipboard as follows:

  • Select and copy the content from a data file

  • Use the read.delim() function

read.delim('clipboard')

Importing data with read.table() and read.csv()

  • Small and medium-size data sets are usually stored in a text file .txt

  • Data in a text file are separated by a delimiter such as white space, comma, tab, or semi-colon.

  • A text file in which data are separated by a comma is equivalent to a comma-separated value file .csv, which is commonly used in statistics and data science due to portability.

### comma-separated value

floor,bed,lease,pool,rent
first,0,9,yes,1200
second,1,12,no,1550
first,2,12,yes,1900
second,1,9,no,1650
first,0,12,yes,1300
  • The read.table() function is a general function to read data from a text file. Before using the function, it is best to move the text file under the current working directory.
read.table('file_name.csv', header = TRUE, sep = ' ')

read.table('file_name.csv', header = TRUE, sep = ',')

read.table('file_name.csv', header = TRUE, sep = '\t')

read.table('file_name.csv', header = TRUE, sep = ';')
  • The read.csv() function is just a wrapper of the read.table() function for importing comma-separated value file .csv
read.csv('file_name.csv')    # same as read.table('file_name.csv', sep = ',')

💻 Hands-On

Import the apartment data.

The below code assumes that we have apartment.txt and apartment.csv in our current working directory. To identify our current working directory, type in the console getwd().

read.table('apartment.txt', header = TRUE, sep = ';')

read.csv('apartment.csv')

Inspecting a data frame

Given a data frame,

  • str() shows its overall structure

  • head() returns its first few rows

  • tail() returns its last few rows

  • dim() returns a vector dimensions

  • nrow() returns its number of rows

  • ncol() returns its number of columns

  • colnames() and names() returns its column names

  • rownames() returns its row names

  • summary() shows descriptive statistics for its columns

  • complete.cases() returns a logical vector indicating which observations have no missing values.

💻 Hands-On

Inspect the iris data.

# overall structure
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# first few rows
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
# last few rows
tail(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica
# vector of dimension
dim(iris)
[1] 150   5
# number of rows
nrow(iris)
[1] 150
# number of columns
ncol(iris)
[1] 5
# variable names
colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
# row names if any; iris does not have special row names
rownames(iris)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
 [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
 [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
 [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
 [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
 [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
 [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
 [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
 [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108"
[109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
[121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
[133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
[145] "145" "146" "147" "148" "149" "150"
# descriptive statistics for each variable (column)
summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
# TRUE if an observation has no missing values; iris has no missing values
complete.cases(iris)
  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

💻 Hands-On

Import and inspect the loan50 data from OpenIntro.

This data set represents 50 loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals. Of course, not all loans are created equal. Someone who is a essentially a sure bet to pay back a loan will have an easier time getting a loan with a low interest rate than someone who appears to be riskier. And for people who are very risky? They may not even get a loan offer, or they may not have accepted the loan offer due to a high interest rate.

  • state - Two-letter state code.

  • emp_length - Number of years in the job, rounded down. If longer than 10 years, then this is represented by the value 10.

  • term - The number of months of the loan the applicant received.

  • home_ownership - The ownership status of the applicant’s residence.

  • annual_income - Annual income.

  • verified_income - Type of verification of the applicant’s income.

  • debt_to_income - Debt-to-income ratio.

  • total_credit_limit - Total available credit, e.g. if only credit cards, then the total of all the credit limits. This excludes a mortgage.

  • total_credit_utilized - Total credit balance, excluding a mortgage.

  • num_cc_carrying_balance - Number of credit cards that are carrying a balance.

  • loan_amount - The amount of the loan the applicant received.

  • grade - Grade associated with the loan.

  • interest_rate - Interest rate of the loan the applicant received.

  • public_record_bankrupt - Number of bankruptcies listed in the public record for this applicant.

  • loan_status - Status of the loan.

  • has_second_income - Whether the applicant has a secondary income.

  • total_income - The applicant’s total income.

If we have loan50.csv in our current working directory, we can also use read.csv('loan50.csv'). Otherwise, we can read the data set as follows:

loan <- read.csv(url('https://raw.githubusercontent.com/hungtong/DS-01-101/refs/heads/main/dataset/loan50.csv'))

Once we have the data set in R, we can inspect the data as below.

# overall structure
str(loan)
'data.frame':   50 obs. of  18 variables:
 $ state                  : chr  "NJ" "CA" "SC" "CA" ...
 $ emp_length             : int  3 10 NA 0 4 6 2 10 6 3 ...
 $ term                   : int  60 36 36 36 60 36 36 36 60 60 ...
 $ homeownership          : chr  "rent" "rent" "mortgage" "rent" ...
 $ annual_income          : int  59000 60000 75000 75000 254000 67000 28800 80000 34000 80000 ...
 $ verified_income        : chr  "Not Verified" "Not Verified" "Verified" "Not Verified" ...
 $ debt_to_income         : num  0.558 1.306 1.056 0.574 0.238 ...
 $ total_credit_limit     : int  95131 51929 301373 59890 422619 349825 15980 258439 87705 330394 ...
 $ total_credit_utilized  : int  32894 78341 79221 43076 60490 72162 2872 28073 23715 32036 ...
 $ num_cc_carrying_balance: int  8 2 14 10 2 4 1 3 10 4 ...
 $ loan_purpose           : chr  "debt_consolidation" "credit_card" "debt_consolidation" "credit_card" ...
 $ loan_amount            : int  22000 6000 25000 6000 25000 6400 3000 14500 10000 18500 ...
 $ grade                  : chr  "B" "B" "E" "B" ...
 $ interest_rate          : num  10.9 9.92 26.3 9.92 9.43 ...
 $ public_record_bankrupt : int  0 1 0 0 0 0 0 0 0 1 ...
 $ loan_status            : chr  "Current" "Current" "Current" "Current" ...
 $ has_second_income      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ total_income           : int  59000 60000 75000 75000 254000 67000 28800 80000 34000 192000 ...
# first few rows
head(loan)
  state emp_length term homeownership annual_income verified_income
1    NJ          3   60          rent         59000    Not Verified
2    CA         10   36          rent         60000    Not Verified
3    SC         NA   36      mortgage         75000        Verified
4    CA          0   36          rent         75000    Not Verified
5    OH          4   60      mortgage        254000    Not Verified
6    IN          6   36      mortgage         67000 Source Verified
  debt_to_income total_credit_limit total_credit_utilized
1      0.5575254              95131                 32894
2      1.3056833              51929                 78341
3      1.0562800             301373                 79221
4      0.5743467              59890                 43076
5      0.2381496             422619                 60490
6      1.0770448             349825                 72162
  num_cc_carrying_balance       loan_purpose loan_amount grade interest_rate
1                       8 debt_consolidation       22000     B         10.90
2                       2        credit_card        6000     B          9.92
3                      14 debt_consolidation       25000     E         26.30
4                      10        credit_card        6000     B          9.92
5                       2   home_improvement       25000     B          9.43
6                       4   home_improvement        6400     B          9.92
  public_record_bankrupt loan_status has_second_income total_income
1                      0     Current             FALSE        59000
2                      1     Current             FALSE        60000
3                      0     Current             FALSE        75000
4                      0     Current             FALSE        75000
5                      0     Current             FALSE       254000
6                      0     Current             FALSE        67000
# last few rows
tail(loan)
   state emp_length term homeownership annual_income verified_income
45    CA          5   36      mortgage         87000 Source Verified
46    TX         10   36      mortgage         60000    Not Verified
47    NV          0   36      mortgage         58500        Verified
48    NH          2   36          rent         50000 Source Verified
49    IL          2   36           own        103000    Not Verified
50    TX          9   36      mortgage         77500        Verified
   debt_to_income total_credit_limit total_credit_utilized
45     0.05472414             294946                  4761
46     0.73118333              64017                 43871
47     0.39682051              49234                 23214
48     0.50262000              39814                 25131
49     0.97359223             150027                100280
50     0.67785806             390156                 52534
   num_cc_carrying_balance       loan_purpose loan_amount grade interest_rate
45                       3 debt_consolidation       12800     B          9.44
46                       5   home_improvement        6000     B         10.42
47                       2 debt_consolidation       30000     D         21.45
48                       4 debt_consolidation        5825     B         10.91
49                       4 debt_consolidation       20000     B          9.43
50                       2              other       15000     A          6.08
   public_record_bankrupt loan_status has_second_income total_income
45                      0     Current             FALSE        87000
46                      0     Current             FALSE        60000
47                      0     Current             FALSE        58500
48                      0     Current             FALSE        50000
49                      0     Current             FALSE       103000
50                      0     Current             FALSE        77500
# vector of dimension
dim(loan)
[1] 50 18
# number of rows
nrow(loan)
[1] 50
# number of columns
ncol(loan)
[1] 18
# variable names
colnames(loan)
 [1] "state"                   "emp_length"             
 [3] "term"                    "homeownership"          
 [5] "annual_income"           "verified_income"        
 [7] "debt_to_income"          "total_credit_limit"     
 [9] "total_credit_utilized"   "num_cc_carrying_balance"
[11] "loan_purpose"            "loan_amount"            
[13] "grade"                   "interest_rate"          
[15] "public_record_bankrupt"  "loan_status"            
[17] "has_second_income"       "total_income"           
# row names if any; loan does not have special row names
rownames(loan)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
[46] "46" "47" "48" "49" "50"
# descriptive statistics for each variable (column)
summary(loan)
    state             emp_length          term       homeownership     
 Length:50          Min.   : 0.000   Min.   :36.00   Length:50         
 Class :character   1st Qu.: 2.000   1st Qu.:36.00   Class :character  
 Mode  :character   Median : 5.000   Median :36.00   Mode  :character  
                    Mean   : 4.896   Mean   :42.72                     
                    3rd Qu.: 8.250   3rd Qu.:60.00                     
                    Max.   :10.000   Max.   :60.00                     
                    NA's   :2                                          
 annual_income    verified_income    debt_to_income    total_credit_limit
 Min.   : 28800   Length:50          Min.   :0.05472   Min.   : 15980    
 1st Qu.: 55750   Class :character   1st Qu.:0.26642   1st Qu.: 70526    
 Median : 74000   Mode  :character   Median :0.54047   Median :147364    
 Mean   : 86170                      Mean   :0.72264   Mean   :208547    
 3rd Qu.: 99500                      3rd Qu.:0.74122   3rd Qu.:299766    
 Max.   :325000                      Max.   :5.33373   Max.   :793009    
                                                                         
 total_credit_utilized num_cc_carrying_balance loan_purpose      
 Min.   :  2872        Min.   : 1.00           Length:50         
 1st Qu.: 25694        1st Qu.: 3.00           Class :character  
 Median : 48006        Median : 4.00           Mode  :character  
 Mean   : 61547        Mean   : 5.06                             
 3rd Qu.: 76796        3rd Qu.: 6.00                             
 Max.   :373361        Max.   :14.00                             
                                                                 
  loan_amount       grade           interest_rate   public_record_bankrupt
 Min.   : 3000   Length:50          Min.   : 5.31   Min.   :0.00          
 1st Qu.: 7125   Class :character   1st Qu.: 7.96   1st Qu.:0.00          
 Median :15500   Mode  :character   Median : 9.93   Median :0.00          
 Mean   :17083                      Mean   :11.57   Mean   :0.08          
 3rd Qu.:24000                      3rd Qu.:13.71   3rd Qu.:0.00          
 Max.   :40000                      Max.   :26.30   Max.   :1.00          
                                                                          
 loan_status        has_second_income  total_income   
 Length:50          Mode :logical     Min.   : 28800  
 Class :character   FALSE:42          1st Qu.: 60000  
 Mode  :character   TRUE :8           Median : 78750  
                                      Mean   :105221  
                                      3rd Qu.:119000  
                                      Max.   :325000  
                                                      
# TRUE if an observation has no missing values; loan has no missing values
complete.cases(loan)
 [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[49]  TRUE  TRUE

Storing data with write.table() and write.csv()

  • To export a data frame to .txt or .csv files, we can use the write.table() and write.csv() functions with different delimiters.
write.table(data_to_store, file = 'file_name.txt', sep = '')

write.table(data_to_store, file = 'file_name.txt', sep = '\t')

write.csv(data_to_store, file = 'file_name.csv')

💻 Hands-On

Export loans that are fully paid from the loan50 data to .txt or .csv files.

One example to save the fully paid loans is

loan_paid <- loan[loan$loan_status == 'Fully Paid', ]

write.table(loan_paid, file = 'fully_paid_loan.txt', sep = ',')

write.csv(loan_paid, file = 'fully_paid_loan.csv')

Storing objects with dput()

  • The dput() function writes an R object in a plain-text format that can be used to reproduce the object.
dput(some_object)

💻 Hands-On

Write loans that are fully paid from the loan50 data in a plain-text format.

The output can be ugly but running it in the console will reproduce the object.

loan_paid <- loan[loan$loan_status == 'Fully Paid', ]

dput(loan_paid)
structure(list(state = c("TX", "WI", "WV", "TX", "CA", "MI"), 
    emp_length = c(2L, 10L, 0L, 5L, 7L, 0L), term = c(60L, 60L, 
    36L, 36L, 36L, 36L), homeownership = c("mortgage", "mortgage", 
    "mortgage", "mortgage", "rent", "mortgage"), annual_income = c(98000L, 
    50000L, 80000L, 65000L, 103000L, 85000L), verified_income = c("Verified", 
    "Verified", "Not Verified", "Not Verified", "Source Verified", 
    "Not Verified"), debt_to_income = c(1.10710204081633, 0.523416666666667, 
    0.7045375, 0.242738461538462, 0.265834951456311, 0.818764705882353
    ), total_credit_limit = c(319551L, 375828L, 197299L, 256974L, 
    124900L, 131847L), total_credit_utilized = c(108496L, 87934L, 
    56363L, 15778L, 27381L, 69595L), num_cc_carrying_balance = c(6L, 
    11L, 6L, 7L, 5L, 3L), loan_purpose = c("credit_card", "debt_consolidation", 
    "debt_consolidation", "debt_consolidation", "other", "debt_consolidation"
    ), loan_amount = c(29400L, 40000L, 4500L, 10000L, 30000L, 
    24000L), grade = c("E", "D", "B", "A", "A", "B"), interest_rate = c(24.85, 
    20, 9.44, 6.71, 7.35, 9.93), public_record_bankrupt = c(0L, 
    0L, 0L, 0L, 0L, 0L), loan_status = c("Fully Paid", "Fully Paid", 
    "Fully Paid", "Fully Paid", "Fully Paid", "Fully Paid"), 
    has_second_income = c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE
    ), total_income = c(98000L, 168000L, 80000L, 65000L, 103000L, 
    85000L)), row.names = c(16L, 22L, 23L, 27L, 34L, 44L), class = "data.frame")

Storing objects with save()

  • The save() allows us to store multiple objects in .rdata or .rda.
save(object1, object2, object3, file = 'file_name.rdata')

💻 Hands-On

Identify the following objects from the loan50 data and store them in .rdata

  • loan_mortgage - loans associated with homeownership that is mortgage

  • loan_20k - loans with the loan amount at least \(20,000\)

  • loan_low_interest - loans with interest rate less than \(7\%\)

If we run the following code, a special_loans.rdata file will be created in the current working directory. If we share this file with someone else and they double-click it, the three saved objects loan_mortgage, loan_20k, and loan_low_interest will be made ready on their end.

loan_mortgage <- loan[loan$homeownership == 'mortage', ]

loan_20k <- loan[loan$loan_amount >= 20000, ]

loan_low_interest <- loan[loan$interest_rate < 7, ]

save(loan_mortgage, loan_20k, loan_low_interest, file = 'special_loans.rdata')

Miscellaneous functions

  • ls() shows all the created objects

  • rm() removes certain objects

  • getwd() returns the file path of the current working directory

  • setwd() sets a new working directory